NYC TLC Taxi Trip Data Analysis: Tipping Behavior¶
by Hans Darmawan - JCDS2602
Background¶
Taxi transportation plays a key role in urban mobility by offering people a simple means of traveling within city limits. It is essential to public transport as it provides convenience, speed, and accessibility unlike anything else. Taxis also help keep public transportation from getting backed up in areas with too many people, and they are a wonderful option for people who do not have their cars (Kumar & Reddy, 2020).
The New York City TLC (Taxi and Limousine Commission) is responsible for regulating taxis and for-hire vehicles in New York City. Founded in 1937, the NYC TLC’s purpose is to ensure safety, accessibility, and quality of service in these industries. They are in charge of licensing drivers and vehicles, making sure that commission rules are followed, gathering information to improve service, and making sure that customers are happy overall (NYC Taxi and Limousine Commission, n.d.).
The NYC TLC’s business procedure incorporates several major milestones, including distributing licenses, conducting business inspections, and adhering to regulations to maintain high service standard levels. The commission analyzes data for taxi rides to spot new trends and enhance service quality. By monitoring the performance metrics, the TLC for New York City can ensure that each service provider is competing within acceptable limits and that the rights of the passengers and drivers are upheld.
The competition between traditional taxis and ride-sharing services such as Uber and Lyft has increased significantly recently. Taxis have been an important way to get around cities for a long time, but these apps make things easier and often offer better prices, so they are taking a big share of the market. This shift in the industry is also influencing people"s behavior. This shift has also affected tipping practices, as passengers may tip differently for traditional taxis compared to app-based services, reflecting the varying experiences and expectations associated with each mode of transport (Cohen & Kietzmann, 2014).
Studying tipping behavior is fundamental to comprehending what drives passengers to tip taxi drivers. Understanding such relations may assist the NYC TLC in strategizing on how to improve service and increase tips for the benefit of the drivers and the passengers (Mason & Dyer, 2019).
Gap Analysis¶
Companies like Lyft and Uber in the ride-sharing industry have integrated tipping systems into their mobile applications. After completing a ride, the system prompts passengers to leave a tip, ensuring a seamless and straightforward process. This convenience encourages riders to tip more frequently, as they can easily select a percentage or enter a custom amount. Researchers have found that how people tip on ride-sharing services is affected by things like the quality of service, how friendly the driver is, and the overall ride experience (Cohen & Kietzmann, 2014). The ride-sharing context"s understanding of tipping has become more consistent due to this structured approach.
In contrast, the tipping situation in traditional taxis is less standardized. Tipping is customary, but it"s not always clear how it works and can differ greatly from one traveler to another. The absence of a structured system can result in inconsistent tipping behavior, as passengers may not always receive prompts to tip. Factors such as the driver"s demeanor, the quality of service, and the overall experience can influence whether a passenger decides to leave a tip and how much they choose to give. Since there isn"t a single way to collect or analyze this data, this variety makes it hard to figure out how people tip in the taxi business.
Comparing to traditional taxis and ride-sharing services, it"s clear that ride-sharing services have a more organized way of getting people to leave tips. Adding tipping options to the app makes it easier for people to leave tips more often and gives analysts useful data to work with. The taxi business, on the other hand, is not as consistent or open, which makes it hard to spot patterns and trends in how people tip. This difference shows that the NYC TLC needs to look into and study tipping behavior in taxis more deeply. Knowing these factors is important for increasing driver pay and improving service quality overall (Mason & Dyer, 2020).
Problem Statement¶
The primary issue currently being discussed is the lack of clarity regarding the factors that influence the number of tips individuals leave in taxis in New York City. Given the information, it is important to look at how factors like trip distance, price amount, and time of day affect how much customers tip. Knowing these links, NYC TLC will be able to see patterns in how people act, which will help them make rules that will improve service and get people to tip more. The study also found information that could be used to teach drivers how to be better customers so that they leave tips. Such lessons would improve both the cab service experience and the driving conditions.
Insight Questions¶
How do the tipping patterns vary by hour of the day and day of the week?
How does payment method (Cash vs. Credit Card) affect tipping behavior in NYC taxis?
How does the pickup borough influence tipping?
How does the dropoff borough influence tipping?
How does the pickup service zone influence tipping?
How does the dropoff service zone influence tipping?
What is the relationship between trip distances and tip amount?
What is the relationship between trip durations and tip amount?
What is the relationship between extra charges and tip amount?
What is the relationship between tolls amount on tip amounts?
What is the relationship between congestion surcharges on tip amounts?
What is the relationship between passengers count on tip amounts?
Data Understanding¶
Deep exploration is needed to fully understand the data, including its structure, details, and overall quality. This makes data analysis understanding a very important step in the process of analyzing any data set. It looks for things in a set of data that can be put into different groups, outliers, or relationships, which can greatly change the outcome of other analyses done on the same data. Therefore, understanding the data to the fullest allows the analyst to make calculated decisions regarding data reliability, the quality of the data itself, and the methods of analysis that will be applied.
The NYC TLC maintains this dataset, which contains a wide range of information about taxi trips in New York City. The TLC trip record dataset is a rich collection. As one can see, this dataset contains various attributes, including but not limited to pickup and drop-off times, number of passengers, distance of the trip, fare given, and even the type of payment used. Transportation analysts, urban planners, and policymakers can use each record to learn more about how taxis work, how customers act, and how to improve service quality. Here’s an explanation of each column in the New York City TLC Trip Record Data Dictionary:
Vendor ID: This column indicates the provider of the taxi meter system. It helps identify which company supplied the technology used for the trip. The values are:
- 1: Creative Mobile Technologies, LLC
- 2: VeriFone Inc.
lpep_pickup_datetime: This column records the date and time when the taxi meter was engaged, marking the start of the trip. It provides essential information for analyzing trip patterns and demand over time. The format for this timestamp is typically in the format YYYY-MM-DD HH:MM:SS.
lpep_dropoff_datetime: This column captures the date and time when the taxi meter was disengaged, indicating the end of the trip. This information is crucial for calculating trip duration and understanding passenger flow. Similar to the pickup datetime, the format is YYYY-MM-DD HH:MM:SS.
store_and_fwd_flag: This column indicates whether the trip record was stored and forwarded to the TLC database. A value of "Y" means the trip was stored and forwarded, while "N" indicates it was not. This distinction is important for understanding data transmission and record accuracy.
trip_duration: This column measures the total duration of the trip in seconds. It is calculated by subtracting the pickup datetime from the dropoff datetime. Analyzing trip duration helps in assessing traffic patterns and service efficiency.
pickup_borough: This column identifies the borough where the trip began. The values include:
- 1: Manhattan
- 2: Brooklyn
- 3: Queens
- 4: The Bronx
- 5: Staten Island
- 6: Outside NYC
This information is vital for understanding geographic demand and service distribution.
dropoff_borough: This column indicates the borough where the trip ended. Similar to the pickup borough, it helps in analyzing the flow of passengers between different areas. The values are the same as those for the pickup borough.
fare_amount: This column records the total fare amount for the trip, excluding any additional charges. It is essential for understanding revenue generation and pricing strategies within the taxi industry. The fare amount is typically expressed in U.S. dollars.
congestion_surcharge: This column indicates whether a congestion surcharge was applied to the trip fare. The surcharge is an additional fee charged during peak congestion times in specific areas of Manhattan. This fee is designed to reduce traffic and encourage the use of alternative transportation methods (New York City Taxi and Limousine Commission, n.d.).
Data Wrangling¶
Data wrangling, which is also called "data munging" or "data preparation," fixes common problems with the quality of data, like missing values, duplicates, outliers, and formatting mistakes (McGrath & Jonker, 2024). This process involves transforming messy or problematic data into clean datasets that are ready for analysis. It is important to keep in mind that data wrangling and data analysis are not the same thing. Data wrangling is an important step that gets data ready for more analysis and insights (Kitchin, 2014).
The Data Wrangling Process¶
The data handling process aims to prepare raw data for analysis. It typically involves several key steps, each designed to address specific issues related to data quality and usability. These steps include identifying data sources, standardizing data formats, correcting errors, enriching datasets with additional information, and ensuring data integrity (Pandas Documentation, 2021).
Discovering¶
The discovering phase focuses on the assessment of the quality of the complete dataset, including its sources and formats. This initial evaluation is crucial for identifying potential issues that could affect the analysis. During this phase, the quality of the data set is assessed, and data sources and formats are identified. Quality issues such as missing data, formatting inconsistencies, errors, or bias, and outliers that might skew the analysis are highlighted and addressed. The findings are typically documented in a data quality report or a more technical document known as a data profiling report, which includes statistics, distributions, and other results.
import pandas as pd
import warnings
warnings.filterwarnings("ignore")
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)
pd.set_option("display.float_format", "{:.2f}".format)
# Load the csv file
real_trips = pd.read_csv("NYC TLC Trip Record.csv")
trips = real_trips.copy()
# Create a dictionary to map column names to sample values
sample_values = dict()
for column in trips.columns:
sample_values[column] = trips[column].head(1).values[0]
sample_values
# Check formatting inconsistencies
trips_dtypes = trips.dtypes.reset_index().rename(columns={0: "Type", "index": "Column Name"})
trips_dtypes["sample Value"] = trips_dtypes["Column Name"].map(sample_values)
trips_dtypes
| Column Name | Type | sample Value | |
|---|---|---|---|
| 0 | VendorID | int64 | 2 |
| 1 | lpep_pickup_datetime | object | 2023-01-01 00:26:10 |
| 2 | lpep_dropoff_datetime | object | 2023-01-01 00:37:11 |
| 3 | store_and_fwd_flag | object | N |
| 4 | RatecodeID | float64 | 1.00 |
| 5 | PULocationID | int64 | 166 |
| 6 | DOLocationID | int64 | 143 |
| 7 | passenger_count | float64 | 1.00 |
| 8 | trip_distance | float64 | 2.58 |
| 9 | fare_amount | float64 | 14.90 |
| 10 | extra | float64 | 1.00 |
| 11 | mta_tax | float64 | 0.50 |
| 12 | tip_amount | float64 | 4.03 |
| 13 | tolls_amount | float64 | 0.00 |
| 14 | ehail_fee | float64 | NaN |
| 15 | improvement_surcharge | float64 | 1.00 |
| 16 | total_amount | float64 | 24.18 |
| 17 | payment_type | float64 | 1.00 |
| 18 | trip_type | float64 | 1.00 |
| 19 | congestion_surcharge | float64 | 2.75 |
There are a total of 20 columns in the dataset, where ID columns such as VendorID and PULocationID should have a numeric data type, but are treated as objects or categories. This indicates that these columns are not correctly interpreted as numbers, which can affect the analysis performed. In addition, there are also several values that should be whole numbers, not decimal numbers, for example RatecodeID, passenger_count. Finally, validation of the values in the ID columns needs to be done to ensure that there are no missing or invalid values.
The time columns, namely lpep_pickup_datetime and lpep_dropoff_datetime, currently have an object data type, which should be converted to a datetime data type. By converting these time columns, analysis related to trip duration and time patterns can be performed more effectively. This mismatch can hinder temporal analysis, such as identifying peak hours and average trip duration.
The column naming in this dataset shows inconsistencies, especially in the use of upper and lower case letters and naming formats. Some columns use PascalCase (such as PULocationID and DOLocationID), while others use snake_case (such as lpep_pickup_datetime and fare_amount). This inconsistency can make data processing difficult, especially when using libraries such as Pandas, where consistency in column naming is essential for easy access and manipulation of data. Therefore, it is recommended to use a consistent naming convention, such as snake_case for all columns, to make data analysis easier to read and maintain (Van Rossum, 2001).
# Check for missing values
trips_isnull_sum = trips.isnull().sum().reset_index()
trips_isnull_sum[1] = trips_isnull_sum[0] / len(trips) * 100
columns_to_rename = {0: "Missing Value Counts", 1: "Percentage (%)", "index": "Column Name"}
trips_isnull_sum = trips_isnull_sum.rename(columns=columns_to_rename)
trips_isnull_sum = trips_isnull_sum[trips_isnull_sum["Missing Value Counts"]!=0]
trips_isnull_sum
| Column Name | Missing Value Counts | Percentage (%) | |
|---|---|---|---|
| 3 | store_and_fwd_flag | 4324 | 6.34 |
| 4 | RatecodeID | 4324 | 6.34 |
| 7 | passenger_count | 4324 | 6.34 |
| 14 | ehail_fee | 68211 | 100.00 |
| 17 | payment_type | 4324 | 6.34 |
| 18 | trip_type | 4334 | 6.35 |
| 19 | congestion_surcharge | 4324 | 6.34 |
Missing data can occur in three forms: Missing At Random (MAR), Missing Not At Random (MNAR) and Missing Completely At Random (MCAR). The missing values related to store_and_fwd_flag, RatecodeID, passenger_count payment_type, and congestion_surcharge are regarded as MAR because they are likely associated with other variables in the dataset, like the state of the network during data gathering.
With the exception of a minute 0.01% difference, the trip_type column may be classified as MNAR if the missing values pertain to some specific unrecorded trip characteristics. Alternatively some other columns with missing values like ehail_fee might fall under MCAR, as their values seem to be missing without any systematic reasoning.
trips.duplicated().sum()
0
To protect the accuracy of information during analysis, integrity accuracy needs to be maintained where an item accompanies every observation. Repetition within items can greatly interfere with the computation of fundamental statistical values such as averages, sums, and totals. The dataset in contention also does not contain duplicate values which is vital for ensuring the trustworthiness of the analysis. The reliable data interpretation, which stems from the absence of duplicate entries, simplifies aids together with advanced data clustering techniques performed on the information. This freedom fosters easier understanding together with reliable assessment and insights, leading to confident conclusions and dependable results. The uniqueness enables achieving reliable results that bear results that will stand any scrutiny.
import matplotlib.pyplot as plt
# Check outliers
columns_to_check = [column for column in trips.columns if trips[column].dtype in ["float64", "int64"] and column!="ehail_fee" and "ID" not in column and "type" not in column]
# Create boxplots for each column in separate subplots
num_columns = len(trips.columns)
rows = 2
cols = 5
# Create a figure with specified size
fig, axs = plt.subplots(rows, cols, figsize=(18, 12))
# Flatten the axes array for easy iteration
axs = axs.flatten()
# Create horizontal boxplots
for i, column in enumerate(columns_to_check):
trips.boxplot(column=column, ax=axs[i])
axs[i].set_title(column.replace("_", " ").title().replace("Mta", "MTA"))
# Adjust layout
plt.tight_layout()
# Display the plot
plt.show()
Outlier management remains one of the most problematic issues in data analysis, as their presence and nature often devastate results and predictive models. Removal, transformation, or substitution with more representative values are some of the ways elimination algorithm strategies suggested by Iglewicz and Hoaglin (1993) put a credible focus on identifying and addressing outlier issues.
- Passenger Count:
The outlier marks in the visualization show that the minimum and maximum range for passengers lies between 0 and 9, with low outlier markings past 8. Furthermore, in terms of cab trips, having a passenger count exceeding 4 is quite rare. These outliers may be due to the nature of the inputs and the intricate nature of certain specific trips. A thorough validation check of the data is recommended to check the precision and validity of these outlier values.
- Trip Distance:
The boxplot shows a notable outlier at the 10,000 mark. It is absolutely conceivable that there are longer taxi rides in NYC. Nevertheless, this figure appears highly questionable. It is critical to look over the trip data for possible errors relating to distance measurements. Also, consider eliminating or changing some of the outlier values as they may be devoid of logic.
- Fare Amount:
Any outlier above $200 may indicate an extremely long trip or an incorrect Fare. These high fares, however, can occur, depending on the situation, such as airport trips. It is critical to validate the fare and consider removing or modifying the value if it is considered invalid.
- Extra Charges:
The extra column includes additional charges, some of which exhibit exceedingly high outlier values. This could stem from a mix of charge outliers or an unusual cost with an input error. It is best to ensure that supporting documents are in check to validate these additional charges with costing policies.
- Tip Amount:
Outlier values greater than $100 suggest a rather unusual tip. While very high tip amounts do occur in this industry, it is best to double-check this data"s accuracy. This data should also be removed or changed if it does not fit the expectation that is deeply rooted in normal patterns.
- Total Amount:
Outlier amounts exceeding $500 show that someone is incurring excessively high trip expenses, likely due to long multi-journey trips or these having input errors. Steps should be taken to validate the accuracy of these figures and consider the removal or adjustment of outlier values should deemed not valid after proper verification and validation steps.
- ID Columns and Types:
The data types for the columns VendorID, PULocationID, and DOLocationID should be integers. The analysis will not be accurate if these values are treated as objects. These columns need to be altered in order to perform effective analysis.
- MTA Tax, Improvement Surcharge, Congestion Surcharge, and Tolls Amount:
As demonstrated during the boxplot analysis, these columns have a low-value range or low-value diversity. No interesting conclusions can be drawn, and the reasoning behind this is, again, a lack of unique values. We know that the effectiveness of outlier analysis is more likely in columns with diverse value distributions (Tukey, 1977).
- Negative Values:
It does not make sense for any Taxi business to have negative or extremely low revenue values, which makes these columns of data illogical. These figures suggest some data entry errors or some situations where costs are greater than revenue. These factors have to be managed quite carefully to avoid discord with the analysis.
Structuring¶
The procedures pertaining to the formation of data, more commonly referred to as data transformation, are important for arranging data in a logical format that can be analyzed. It comprises several primary steps, including summarizing and aggregating statistics, which result in the combination of multiple rows, as well as some form of value-driven data grouping. Furthermore, it involves changing the position of data containing rows and columns to make them more functional and clearer. When the data is organized properly, analysts will have a transparent view of working with the data for easy examination and interpretation, which ensures better accuracy and reliability in insights as well as strategic decisions (Kelleher & Tierney, 2018).
# Renaming the columns
columns_to_rename = {
"VendorID": "vendor_id",
"lpep_pickup_datetime": "pickup_datetime",
"lpep_dropoff_datetime": "dropoff_datetime",
"store_and_fwd_flag": "store_and_fwd_flag",
"RatecodeID": "rate_code_id",
"PULocationID": "pickup_location_id",
"DOLocationID": "dropoff_location_id",
"passenger_count": "passenger_count",
"ehail_fee": "ehail_fee",
"payment_type": "payment_type_id",
"trip_type": "trip_type_id",
"congestion_surcharge": "congestion_surcharge"
}
trips = trips.rename(columns=columns_to_rename)
trips.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 68211 entries, 0 to 68210 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 vendor_id 68211 non-null int64 1 pickup_datetime 68211 non-null object 2 dropoff_datetime 68211 non-null object 3 store_and_fwd_flag 63887 non-null object 4 rate_code_id 63887 non-null float64 5 pickup_location_id 68211 non-null int64 6 dropoff_location_id 68211 non-null int64 7 passenger_count 63887 non-null float64 8 trip_distance 68211 non-null float64 9 fare_amount 68211 non-null float64 10 extra 68211 non-null float64 11 mta_tax 68211 non-null float64 12 tip_amount 68211 non-null float64 13 tolls_amount 68211 non-null float64 14 ehail_fee 0 non-null float64 15 improvement_surcharge 68211 non-null float64 16 total_amount 68211 non-null float64 17 payment_type_id 63887 non-null float64 18 trip_type_id 63877 non-null float64 19 congestion_surcharge 63887 non-null float64 dtypes: float64(14), int64(3), object(3) memory usage: 10.4+ MB
Renaming columns in a dataset can improve readability and make it easier to understand and manipulate the data. This not only makes it easier to interpret and process the data, but also reduces the likelihood of mistakes, leading to more reliable results. Therefore, changing column names is a crucial part of data analysis.
# Data type conversion
categoricals = ["vendor_id", "store_and_fwd_flag", "rate_code_id", "payment_type_id", "trip_type_id", "pickup_location_id", "dropoff_location_id"]
datetimes = ["pickup_datetime", "dropoff_datetime"]
for column in categoricals:
trips[column] = trips[column].astype("category")
for column in datetimes:
trips[column] = pd.to_datetime(trips[column])
trips.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 68211 entries, 0 to 68210 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 vendor_id 68211 non-null category 1 pickup_datetime 68211 non-null datetime64[ns] 2 dropoff_datetime 68211 non-null datetime64[ns] 3 store_and_fwd_flag 63887 non-null category 4 rate_code_id 63887 non-null category 5 pickup_location_id 68211 non-null category 6 dropoff_location_id 68211 non-null category 7 passenger_count 63887 non-null float64 8 trip_distance 68211 non-null float64 9 fare_amount 68211 non-null float64 10 extra 68211 non-null float64 11 mta_tax 68211 non-null float64 12 tip_amount 68211 non-null float64 13 tolls_amount 68211 non-null float64 14 ehail_fee 0 non-null float64 15 improvement_surcharge 68211 non-null float64 16 total_amount 68211 non-null float64 17 payment_type_id 63887 non-null category 18 trip_type_id 63877 non-null category 19 congestion_surcharge 63887 non-null float64 dtypes: category(7), datetime64[ns](2), float64(11) memory usage: 7.4 MB
By using the right data types makes computing more efficient because operations on the right types run faster and better. Furthermore, transforming data types can be helpful in reducing RAM consumption, given that smaller data types utilize less memory. This optimization is crucial with larger datasets, as insufficient memory can hinder performance. Lastly, changing data types appropriately accelerates the processing and analyzing of the data which improves the speed of gaining insights while offering favorable results.
Cleaning¶
Data cleaning involves the handling of missing values, the removal of duplicates, and the correction of errors or inconsistencies. This process might also involve the smoothing of “noisy” data, that is, the application of techniques that reduce the impact of random variations or other issues in the data. When cleaning, it is important to avoid unnecessary data loss or overcleaning, which can remove valuable information or distort the data.
# Impute missing values
# MCAR
trips = trips.drop(columns=["ehail_fee"])
# MNAR
trips["trip_type_id"] = trips["trip_type_id"].fillna(trips["trip_type_id"].mode()[0])
# MAR
numericals = ["passenger_count", "congestion_surcharge"]
for column in numericals:
trips[column] = trips[column].fillna(trips[column].median())
# MAR
categoricals = ["store_and_fwd_flag", "rate_code_id", "payment_type_id"]
for column in categoricals:
trips[column] = trips[column].fillna(trips[column].mode()[0])
trips.isnull().sum()
vendor_id 0 pickup_datetime 0 dropoff_datetime 0 store_and_fwd_flag 0 rate_code_id 0 pickup_location_id 0 dropoff_location_id 0 passenger_count 0 trip_distance 0 fare_amount 0 extra 0 mta_tax 0 tip_amount 0 tolls_amount 0 improvement_surcharge 0 total_amount 0 payment_type_id 0 trip_type_id 0 congestion_surcharge 0 dtype: int64
Handling for MAR and MNAR columns usually involves more complex imputation techniques, while MCAR columns can be removed without much impact on the analysis. To handle missing values in MAR and MNAR columns, the Central Tendency technique can be used. For numeric data, missing values can be filled with the mean or median of the column, depending on the distribution of the data; the median is preferred if there are outliers. As for categorical columns, missing values can be filled with the mode, which is the most frequently occurring value in the column. This approach helps maintain data consistency and minimizes bias that may arise due to missing values (Schafer & Graham, 2002).
# Correcting some float values into integer
integers = ["rate_code_id", "passenger_count", "payment_type_id", "trip_type_id"]
for column in integers:
if column != "passenger_count":
trips[column] = trips[column].astype("int64").astype("category")
trips.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 68211 entries, 0 to 68210 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 vendor_id 68211 non-null category 1 pickup_datetime 68211 non-null datetime64[ns] 2 dropoff_datetime 68211 non-null datetime64[ns] 3 store_and_fwd_flag 68211 non-null category 4 rate_code_id 68211 non-null category 5 pickup_location_id 68211 non-null category 6 dropoff_location_id 68211 non-null category 7 passenger_count 68211 non-null float64 8 trip_distance 68211 non-null float64 9 fare_amount 68211 non-null float64 10 extra 68211 non-null float64 11 mta_tax 68211 non-null float64 12 tip_amount 68211 non-null float64 13 tolls_amount 68211 non-null float64 14 improvement_surcharge 68211 non-null float64 15 total_amount 68211 non-null float64 16 payment_type_id 68211 non-null category 17 trip_type_id 68211 non-null category 18 congestion_surcharge 68211 non-null float64 dtypes: category(7), datetime64[ns](2), float64(10) memory usage: 6.9 MB
It is very important to change some columns from float to integer and then back to categorical data types for a number of reasons. First, it helps keep the integrity of the data by making sure that data types accurately describe the type of data, which is important for accurate analysis. Also, using categorical data types instead of integer or float types saves more memory, especially when working with columns that only have a few unique values. This efficiency can lead to improved performance during data processing and analysis. Putting data into groups can also make results easier to understand by making it easier to find patterns and trends in the data. Overall, these conversions help make the analytical process more efficient and effective (Little & Rubin, 2019).
# Outlier removal for numerical columns using iqr
upper_bound_exclusions = ["tolls_amount", "improvement_surcharge", "congestion_surcharge"]
all_exclusions = ["passenger_count"]
numericals = [column for column in trips.columns if trips[column].dtype in ["float64", "int64"] and column not in all_exclusions]
for column in numericals:
q1 = trips[column].quantile(0.25)
q3 = trips[column].quantile(0.75)
iqr = q3 - q1
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr
if column not in upper_bound_exclusions:
trips = trips[(trips[column] >= lower_bound) & (trips[column] <= upper_bound)]
else:
trips = trips[(trips[column] >= lower_bound)]
print(trips.shape)
import matplotlib.pyplot as plt
# Check outliers
columns_to_filter = numericals.copy()
columns_to_filter = columns_to_filter + ["passenger_count"]
# Create boxplots for each column in separate subplots
num_columns = len(trips.columns)
rows = 2
cols = 5
# Create a figure with specified size
fig, axs = plt.subplots(rows, cols, figsize=(18, 12))
# Flatten the axes array for easy iteration
axs = axs.flatten()
# Create horizontal boxplots
for i, column in enumerate(columns_to_filter):
trips.boxplot(column=column, ax=axs[i])
# Adjust layout
plt.tight_layout()
# Display the plot
plt.show()
(43484, 19)
The aim here is to remove outliers from the numerical data in question so that analysis may be conducted at a higher level of quality. In the case of non-normally distributed data, the IQR method is useful for outlier removal because it disregards the lowest and highest 25% of values, concentrating on the central 50% (Tukey, 1977). Other methods enable the use of Z-scores for normally distributed data. In this case, some columns are subject to upper bound exclusions for the sole purpose of eliminating negative values, while other columns are evaluated for both lower and upper limits. The passenger_count column is excluded from the outlier treatment because these values are meaningful and relevant to the analysis. As a result, the amount of data was reduced by 36% (from 68.211 to 43.484 data).
Enriching¶
Adding new information to existing datasets, also known as "data enrichment" or "data augmentation," is an important task that makes them more useful. This work first evaluates the available data to determine what additional information is required for enhanced analysis—for the data to be more useful. External databases, public records, or even contracted services can provide such information. After defining the relevant data, it needs to be integrated with the existing dataset with utmost care so as not to distort the dataset’s consistency and accuracy. Data enrichment in its ideal form enhances the quality of insights derived from the data and improves decision-making as well as planning in the organization. Businesses can be more competitive and run more smoothly when they use enriched data.
# Extract hour, day of week, and month from pickup datetime
trips["pickup_hour"] = trips["pickup_datetime"].dt.hour
trips["pickup_day_of_week"] = trips["pickup_datetime"].dt.day_name()
trips["pickup_month"] = trips["pickup_datetime"].dt.month_name()
trips["trip_duration_minutes"] = (trips["dropoff_datetime"] - trips["pickup_datetime"]).dt.total_seconds() / 60
trips["trip_duration_minutes"] = round(trips["trip_duration_minutes"], 2)
trips["trip_duration_hours"] = round(trips["trip_duration_minutes"] / 60, 2)
trips.head()
| vendor_id | pickup_datetime | dropoff_datetime | store_and_fwd_flag | rate_code_id | pickup_location_id | dropoff_location_id | passenger_count | trip_distance | fare_amount | extra | mta_tax | tip_amount | tolls_amount | improvement_surcharge | total_amount | payment_type_id | trip_type_id | congestion_surcharge | pickup_hour | pickup_day_of_week | pickup_month | trip_duration_minutes | trip_duration_hours | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 2023-01-01 00:26:10 | 2023-01-01 00:37:11 | N | 1 | 166 | 143 | 1.00 | 2.58 | 14.90 | 1.00 | 0.50 | 4.03 | 0.00 | 1.00 | 24.18 | 1 | 1 | 2.75 | 0 | Sunday | January | 11.02 | 0.18 |
| 1 | 2 | 2023-01-01 00:51:03 | 2023-01-01 00:57:49 | N | 1 | 24 | 43 | 1.00 | 1.81 | 10.70 | 1.00 | 0.50 | 2.64 | 0.00 | 1.00 | 15.84 | 1 | 1 | 0.00 | 0 | Sunday | January | 6.77 | 0.11 |
| 2 | 2 | 2023-01-01 00:35:12 | 2023-01-01 00:41:32 | N | 1 | 223 | 179 | 1.00 | 0.00 | 7.20 | 1.00 | 0.50 | 1.94 | 0.00 | 1.00 | 11.64 | 1 | 1 | 0.00 | 0 | Sunday | January | 6.33 | 0.11 |
| 5 | 2 | 2023-01-01 00:53:31 | 2023-01-01 01:11:04 | N | 1 | 41 | 262 | 1.00 | 2.78 | 17.70 | 1.00 | 0.50 | 0.00 | 0.00 | 1.00 | 22.95 | 2 | 1 | 2.75 | 0 | Sunday | January | 17.55 | 0.29 |
| 7 | 2 | 2023-01-01 00:11:58 | 2023-01-01 00:24:55 | N | 1 | 24 | 75 | 1.00 | 1.88 | 14.20 | 1.00 | 0.50 | 0.00 | 0.00 | 1.00 | 16.70 | 2 | 1 | 0.00 | 0 | Sunday | January | 12.95 | 0.22 |
This process improves the trips dataset by integrating valuable time features. Extracting the pickup’s hour, weekday, and month enables an easier comprehension of the different patterns of taxi usage. These features facilitate the understanding of peak hours, daily patterns, and seasonal trends, which are critical in understanding customer behavior. For example, knowing the times of peak demand provides opportunities to enhance service delivery and operational efficiency. In addition, analyzing trip durations in minutes and hours permits more effective understanding and planning of travel times, thus enabling more effective allocation of resources. All together, these improvements lead to a deep understanding of taxi operations to make informed decisions and enhance customer satisfaction.
# Mapping Id to Name based on Data Dictionary
# Create vendor_id to vendor_name
vendor_mapping = {
1: "Creative Mobile Technologies, LLC.",
2: "VeriFone Inc."
}
# Create rate_code_id to rate_code_name
rate_code_mapping = {
1: "standard Rate",
2: "JFK",
3: "Newark",
4: "Nassau or Westchester",
5: "Negotiated Fare",
6: "Group Ride",
99: "Unknown"
}
# Create store_and_fwd_flag to store_and_fwd_flag_name
store_and_fwd_mapping = {
"Y": "store and Forward Trip",
"N": "Not A Store and Forward Trip",
}
# Create payment_type to payment_type_name
payment_type_mapping = {
1: "Credit Card",
2: "Cash",
3: "No Charge",
4: "Dispute",
5: "Unknown",
6: "Voided Trip"
}
# Create trip_type to trip_type_name
trip_type_mapping = {
1: "street-hail",
2: "Dispatch"
}
# Create new columns based on dictionaries
trips["vendor_name"] = trips["vendor_id"].map(vendor_mapping).astype("category")
trips["rate_code_name"] = trips["rate_code_id"].map(rate_code_mapping).astype("category")
trips["store_and_fwd_flag_name"] = trips["store_and_fwd_flag"].map(store_and_fwd_mapping).astype("category")
trips["payment_type_name"] = trips["payment_type_id"].map(payment_type_mapping).astype("category")
trips["trip_type_name"] = trips["trip_type_id"].map(trip_type_mapping).astype("category")
trips.head()
| vendor_id | pickup_datetime | dropoff_datetime | store_and_fwd_flag | rate_code_id | pickup_location_id | dropoff_location_id | passenger_count | trip_distance | fare_amount | extra | mta_tax | tip_amount | tolls_amount | improvement_surcharge | total_amount | payment_type_id | trip_type_id | congestion_surcharge | pickup_hour | pickup_day_of_week | pickup_month | trip_duration_minutes | trip_duration_hours | vendor_name | rate_code_name | store_and_fwd_flag_name | payment_type_name | trip_type_name | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 2023-01-01 00:26:10 | 2023-01-01 00:37:11 | N | 1 | 166 | 143 | 1.00 | 2.58 | 14.90 | 1.00 | 0.50 | 4.03 | 0.00 | 1.00 | 24.18 | 1 | 1 | 2.75 | 0 | Sunday | January | 11.02 | 0.18 | VeriFone Inc. | standard Rate | Not A Store and Forward Trip | Credit Card | street-hail |
| 1 | 2 | 2023-01-01 00:51:03 | 2023-01-01 00:57:49 | N | 1 | 24 | 43 | 1.00 | 1.81 | 10.70 | 1.00 | 0.50 | 2.64 | 0.00 | 1.00 | 15.84 | 1 | 1 | 0.00 | 0 | Sunday | January | 6.77 | 0.11 | VeriFone Inc. | standard Rate | Not A Store and Forward Trip | Credit Card | street-hail |
| 2 | 2 | 2023-01-01 00:35:12 | 2023-01-01 00:41:32 | N | 1 | 223 | 179 | 1.00 | 0.00 | 7.20 | 1.00 | 0.50 | 1.94 | 0.00 | 1.00 | 11.64 | 1 | 1 | 0.00 | 0 | Sunday | January | 6.33 | 0.11 | VeriFone Inc. | standard Rate | Not A Store and Forward Trip | Credit Card | street-hail |
| 5 | 2 | 2023-01-01 00:53:31 | 2023-01-01 01:11:04 | N | 1 | 41 | 262 | 1.00 | 2.78 | 17.70 | 1.00 | 0.50 | 0.00 | 0.00 | 1.00 | 22.95 | 2 | 1 | 2.75 | 0 | Sunday | January | 17.55 | 0.29 | VeriFone Inc. | standard Rate | Not A Store and Forward Trip | Cash | street-hail |
| 7 | 2 | 2023-01-01 00:11:58 | 2023-01-01 00:24:55 | N | 1 | 24 | 75 | 1.00 | 1.88 | 14.20 | 1.00 | 0.50 | 0.00 | 0.00 | 1.00 | 16.70 | 2 | 1 | 0.00 | 0 | Sunday | January | 12.95 | 0.22 | VeriFone Inc. | standard Rate | Not A Store and Forward Trip | Cash | street-hail |
This process involves using a predefined data dictionary to map names to identification numbers, which should be pretty clear. This makes the dataset easier for people to use. The data is interpreted by applying mappings to different types of data, such as vendor IDs, rate codes, and flags. For example, vendor IDs are assigned to his/her respective company, and rate codes are assigned to titles illustrating the fare type. These changes make the data simpler to analyze and report, as users do not have to go back to the original data to understand what the numbers represent. Therefore, this mapping process helps people understand the data better and lets them analyze it better.
real_zones = pd.read_csv("taxi_zones/taxi_zone_lookup.csv")
# Copy the DataFrame
zones = real_zones.copy()
zones.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 265 entries, 0 to 264 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 LocationID 265 non-null int64 1 Borough 264 non-null object 2 Zone 264 non-null object 3 service_zone 263 non-null object dtypes: int64(1), object(3) memory usage: 8.4+ KB
This step imports external data from a CSV file containing taxi zone data. This procedure adds value to the existing dataset by adding spatial elements. Putting the data into a frame makes it easier to look at taxi trips in relation to certain areas. Creating a duplicate of the data frame enables the original data to undergo unchanged analyses while the duplicate undergoes modifications or analyses. This description is fundamental; it gives an overall unique understanding of the dataset. It is important to improve geographic information system (GIS) analysis before looking into what can be learned by automatically connecting taxi trips to zones.
zones.duplicated().sum()
0
This task searches for duplicate entries within the Taxi Zone DataFrame by identifying redundant rows. Identifying duplicates is a crucial aspect of data management, as it helps maintain high data quality and ensures that other processes operate smoothly. In this context, understanding the metrics of the duplicates assists in assessing the uniqueness of the dataset. This step verifies the data"s accuracy, enabling informed decisions in subsequent analyses. Eliminating duplicates will enhance the dataset"s quality and the precision of the conclusions drawn from it.
columns_to_check = [column for column in zones.columns if zones[column].dtype=="object"]
for column in columns_to_check:
display(zones[zones[column].isnull()])
| LocationID | Borough | Zone | service_zone | |
|---|---|---|---|---|
| 264 | 265 | NaN | Outside of NYC | NaN |
| LocationID | Borough | Zone | service_zone | |
|---|---|---|---|---|
| 263 | 264 | Unknown | NaN | NaN |
| LocationID | Borough | Zone | service_zone | |
|---|---|---|---|---|
| 263 | 264 | Unknown | NaN | NaN |
| 264 | 265 | NaN | Outside of NYC | NaN |
The process described here deals with finding and highlighting all missing entries in the taxi zone DataFrame for the specific categorical columns. Originally, all columns with object data types were included so that only relevant categorical data were examined. In the following loop, each of these columns was checked for null values, and the rows containing missing values were returned. The step is important from a data quality perspective because it helps to determine whether there are gaps in the dataset that require some attention in the subsequent steps of analysis or processing.
zones = zones.fillna("Unknown")
zones.isnull().sum()
LocationID 0 Borough 0 Zone 0 service_zone 0 dtype: int64
In some cases, it may be better to use unknown values instead of means, medians, or other central measures when some features are missing. This strategy retains the accuracy of the dataset because it unmistakably denotes a lack of data instead of manipulating values, raising them above or reducing them below what they should be. If there are missing values in categorical data, they should be replaced with a word like "unknown." This keeps the different categories intact and makes it impossible to get the wrong idea from the average of those values. This method works best when the lack of evidence needs to be brought to light instead of being hidden (Little & Rubin, 2019).
borough_mapping = dict(zip(zones["LocationID"], zones["Borough"]))
zone_mapping = dict(zip(zones["LocationID"], zones["Zone"]))
service_zone_mapping = dict(zip(zones["LocationID"], zones["service_zone"]))
trips["pickup_borough"] = trips["pickup_location_id"].map(borough_mapping).astype("category")
trips["pickup_zone"] = trips["pickup_location_id"].map(zone_mapping).astype("category")
trips["pickup_service_zone"] = trips["pickup_location_id"].map(service_zone_mapping).astype("category")
trips["dropoff_borough"] = trips["dropoff_location_id"].map(borough_mapping).astype("category")
trips["dropoff_zone"] = trips["dropoff_location_id"].map(zone_mapping).astype("category")
trips["dropoff_service_zone"] = trips["dropoff_location_id"].map(service_zone_mapping).astype("category")
trips_isnull_sum = trips.isnull().sum().reset_index().rename(columns={"index":"Column Name", 0:"Missing Value Counts"})
trips_isnull_sum = trips_isnull_sum[trips_isnull_sum["Missing Value Counts"]!=0]
trips_isnull_sum
| Column Name | Missing Value Counts |
|---|
Trip data frame is enhanced by the addition of the taxi zone data frame which contains maps of the location ID boroughs, zones and service zones. Due to these relations, the code creates new columns for pick-up and drop-off locations with service zone, borough and zone detailing which are added into the columns. This spatial context considerably enriches the dataset and allows analysis of taxi trips and evaluation of how serviced areas evolve over time. This, in turn, deepens the understanding of usage patterns and operational efficiency of the taxi service.
import geopandas as gpd
# Load the shapefile
gdf = gpd.read_file("taxi_zones/taxi_zones.shp")
# Save as GeoJSON
gdf.to_file("taxi_zones/taxi_zones.geojson", driver="GeoJSON")
The conversion of Shapefile to GeoJSON improves compatibility and efficiency in geospatial data analysis. Shapefiles have limitations and poor Unicode support, while GeoJSON is a single-file format with superior support and a compact size, making it ideal for web-based applications and exploratory data analysis (Häme, 2019).
Validating¶
This phase aims to confirm that the data is accurate and thorough. Initially, checks relating to domain rules and the relationships among elements of the data and gaps in the dataset were found. Type verification, value or format restriction validation, consistency validation, uniqueness verification, field and cross-field checks, as well as statistical level checks are some of the other validation procedures carried out. After the final validation steps are completed, the data is either published or made ready for use in one or more applications. Data processes like these enhance their quality and trustworthiness, ensuring readiness for deeper analytic activities. This empowers them to improve their decision-making and reap the benefits of insights.
# Describing the cleaned data
trips.describe()
| pickup_datetime | dropoff_datetime | passenger_count | trip_distance | fare_amount | extra | mta_tax | tip_amount | tolls_amount | improvement_surcharge | total_amount | congestion_surcharge | pickup_hour | trip_duration_minutes | trip_duration_hours | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 43484 | 43484 | 43484.00 | 43484.00 | 43484.00 | 43484.00 | 43484.00 | 43484.00 | 43484.00 | 43484.00 | 43484.00 | 43484.00 | 43484.00 | 43484.00 | 43484.00 |
| mean | 2023-01-16 22:31:53.118687232 | 2023-01-16 22:47:33.103670016 | 1.36 | 1.93 | 12.90 | 0.75 | 0.50 | 1.70 | 0.02 | 1.00 | 17.55 | 0.68 | 14.12 | 15.67 | 0.26 |
| min | 2023-01-01 00:01:31 | 2023-01-01 00:16:02 | 1.00 | 0.00 | 0.00 | 0.00 | 0.50 | 0.00 | 0.00 | 1.00 | 1.50 | 0.00 | 0.00 | 0.00 | 0.00 |
| 25% | 2023-01-09 12:15:41.249999872 | 2023-01-09 12:31:58.750000128 | 1.00 | 1.09 | 8.60 | 0.00 | 0.50 | 0.00 | 0.00 | 1.00 | 12.20 | 0.00 | 11.00 | 6.77 | 0.11 |
| 50% | 2023-01-17 09:13:36 | 2023-01-17 09:24:28.500000 | 1.00 | 1.66 | 12.10 | 0.00 | 0.50 | 1.46 | 0.00 | 1.00 | 16.32 | 0.00 | 15.00 | 10.17 | 0.17 |
| 75% | 2023-01-24 15:51:52.500000 | 2023-01-24 16:05:54.750000128 | 1.00 | 2.52 | 16.30 | 1.00 | 0.50 | 3.00 | 0.00 | 1.00 | 21.95 | 0.00 | 18.00 | 14.45 | 0.24 |
| max | 2023-01-31 23:58:23 | 2023-02-01 17:27:05 | 6.00 | 6.35 | 31.00 | 2.50 | 0.50 | 7.64 | 11.75 | 1.00 | 37.20 | 2.75 | 23.00 | 1438.93 | 23.98 |
| std | NaN | NaN | 1.06 | 1.20 | 5.60 | 1.01 | 0.00 | 1.82 | 0.34 | 0.00 | 6.94 | 1.19 | 5.19 | 76.73 | 1.28 |
pickup_datetime: For the taxi dataset, there is a record of 43,484 taxi trips, which reveals how frequently taxis are utilized over a duration. The average value is around January 16, 2023, 22:31. Indicates that a greater number of trips are taken during the evening. This suggests that a significant number of passengers probably take taxis after work or going out.
dropoff_datetime: The meantime for dropping off the passengers is a few minutes, well, mostly ever so slightly past pickup time, which is 22:47. This means that the duration of these trips is relatively short, around 15 to 20 minutes. The dropoff times further confirm the evening usage pattern because plenty of passengers are slowly driving home or going out during this period.
passenger_count: The minimum number of passengers is 1, which indicates solo travelers, while the maximum is 6, which indicates moderately larger groups. Based on this data, it seems that most taxi trips are taken by people rather than groups. These fluctuations in the number of passengers have also allowed taxi service providers to economically streamline the number of vehicles in their fleet during certain peak times of four.
trip_distance: The mean distance traveled during a trip is 1.93 miles. This value shows that most taxi rides are short. The recorded distance ranges from a minimum of 0 miles to a maximum of 6.35 miles. This value indicates that taxi services provide short and moderate journey options, which is beneficial in attracting a greater number of customers around the region.
fare_amount: The city also has reasonable urban transport services, with the average fare for a trip being around $12.90. Fare amounts also exhibit variability, as the minimum and maximum fare amounts range from $0 to $31. The range suggests that taxi fares are mostly affordable; however, some trips can be much more expensive due to longer distances or additional charges. Making estimates about additional cost variability allows passengers to budget for their rides and adjust their financial plans accordingly.
extra_charges: The required average extra charge currently stands at $0.75. Although this does not alter the fare considerably, it is still a payment passengers incur. These additional fees cover a range of various charges while offering transparency for the passengers. Understanding additional fees enables these riders to budget for the total cost of the trips.
mta_tax: There is a tax fee of $0.50 for every trip, which means this cost is a bottom-line expense for riders. It is charged as a flat fee so that all passengers pay something towards the upkeep of the transport network. Given this information, an MTA ridership has a better way of estimating his total fare.
tip_amount: Tips tend to equal approximately $1.70, which demonstrates the social practice of tipping taxi drivers after the service. The biggest tip ever filed rests at $7.64, which indicates that for some trips, quite a number of passengers are very generous, especially during longer or more complicated trips. This figure does illustrate how content users are with the services provided and their general sentiments.
tolls_amount: On the one hand, people do not pay to use some of the routes, while on the other hand, others are extremely expensive. These values, especially aggregate toll amounts, remain inconceivably low, which makes a case for my assumption that many journeys that do not utilize toll booths are greatly subsidized. This difference could influence the economical cost of the fare and travelers would need to consider toll expenses when planning their journeys.
improvement_surcharge: For every ride taken, the Improvement Surcharge is intended to support the taxi services’ system – the same way as MTA tax is used – Thus, it has a fixed amount of $1.00. This helps to explain why so many riders personally complain about the taxi system and its fare structure.
congestion_surcharge: Every journey includes the same congestion surcharge of $0.50, billed at specified peak traffic periods. Congestion surcharges manage demand and incentivize travel during non-peak periods. Congestion surcharge awareness aids in optimal trip planning for passengers.
total_amount: The approximate average total amount charged for a trip, inclusive of fares and additional costs, is around $17.55. However, this value may greatly fluctuate, presenting a minimum of $1.50 and a maximum of $37.20, indicating variability in the costliness of certain trips. Understanding this spending range helps passengers cover any unplanned expenses when utilizing taxi services.
trip_duration: Average durations of taxi trips ride are around 15.67 minutes, which is reasonable given the urban nature of the setting where taxis are used. Nonetheless, at least one trip duration minute outlier exists, meaning some rides take more time to complete than the average duration would imply. Extended trip duration can result from increased congestion, longer travel paths, and inefficient routing or navigation, and these factors are considerations taxi operators need to factor in if they want to streamline their operational efficiency analyses.
columns_to_remove = ["mta_tax", "improvement_surcharge"]
trips = trips.drop(columns=columns_to_remove)
The mta_tax and improvement_surcharge columns were left out because they didn"t change much from entry to entry. A value that shows up over and over in a dataset doesn"t add any new information; it just acts as a constant. Getting rid of features that aren"t needed helps make a dataset more manageable by focusing on the most important ones, like the tax fares and trip dynamics. The processing and analysis of the data become far more efficient because many insights have already been extracted. Preserving the most important data while improving a dataset always leads to increased efficiency in this kind of situation. In so doing, the dataset can serve a wider variety of purposes.
import seaborn as sns
plt.subplot(1, 2, 1)
sns.boxplot(trips["trip_duration_minutes"])
plt.subplot(1, 2, 2)
sns.boxplot(trips["trip_duration_hours"])
plt.tight_layout()
plt.show()
The boxplots for trip_duration_minutes and trip_duration_hours display that there are multiple outlier values; this suggests that some taxi trips are taking much longer than average. In the left plot, the outliers are greater than 1,400 minutes, and in the right one, the outliers are greater than 20 hours. Such outlier values can introduce bias in the overall analysis, necessitating their removal to achieve a better understanding of the more typical trip durations. Removing these values allows the analysis to concentrate on the more representative set of data which can enhance service effectiveness along with the overall passenger satisfaction. For the analysis to realistically depict trends in trips, this is a necessary step.
q1 = trips["trip_duration_minutes"].quantile(0.25)
q3 = trips["trip_duration_minutes"].quantile(0.75)
iqr = q3 - q1
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr
trips = trips[(trips["trip_duration_minutes"] >= lower_bound) & (trips["trip_duration_minutes"] <= upper_bound)]
print(trips.shape)
import seaborn as sns
plt.subplot(1, 2, 1)
sns.boxplot(trips["trip_duration_minutes"])
plt.subplot(1, 2, 2)
sns.boxplot(trips["trip_duration_hours"])
plt.tight_layout()
plt.show()
(42408, 33)
After removing the outliers, the boxplots for trip_duration_minutes and trip_duration_hours give a more accurate picture of how long most taxi rides last. For the left boxplot, it can be observed that the bulk of trips lie between 10 and 25 minutes without any extreme values distorting the data. Likewise, for the right boxplot, the majority of trips are now under 0.4 hours, which better reflects the trip length as per the actual time spent. Such improved visualizations lead to a deeper understanding of passenger journeys alongside operational efficiency. In short, getting rid of outliers has made it easier to understand common patterns in taxi use and has made it possible to analyze trip lengths more usefully.
# Describing the cleaned data
trips.describe(include=["category","object"])
| vendor_id | store_and_fwd_flag | rate_code_id | pickup_location_id | dropoff_location_id | payment_type_id | trip_type_id | pickup_day_of_week | pickup_month | vendor_name | rate_code_name | store_and_fwd_flag_name | payment_type_name | trip_type_name | pickup_borough | pickup_zone | pickup_service_zone | dropoff_borough | dropoff_zone | dropoff_service_zone | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 42408 | 42408 | 42408 | 42408 | 42408 | 42408 | 42408 | 42408 | 42408 | 42408 | 42408 | 42408 | 42408 | 42408 | 42408 | 42408 | 42408 | 42408 | 42408 | 42408 |
| unique | 2 | 1 | 3 | 159 | 230 | 4 | 2 | 7 | 1 | 2 | 3 | 1 | 4 | 2 | 6 | 158 | 4 | 6 | 229 | 4 |
| top | 2 | N | 1 | 74 | 74 | 1 | 1 | Tuesday | January | VeriFone Inc. | standard Rate | Not A Store and Forward Trip | Credit Card | street-hail | Manhattan | East Harlem North | Boro Zone | Manhattan | East Harlem North | Boro Zone |
| freq | 42309 | 42408 | 42384 | 9302 | 2559 | 26414 | 42407 | 6959 | 42408 | 42309 | 42384 | 42408 | 26414 | 42407 | 26292 | 9302 | 39499 | 26146 | 2559 | 28829 |
vendor_id: Based on the provided data, there were two distinct vendors. The mode, which is indicative of the most frequent value in a set of data, was found to be 2. This value occurred 42,309 times, indicating the competitive landscape and variety of choices available to customers.
store_and_fwd_flag: This field had only one unique value, which means all the trips were classified as not store and forward. The mode was N, and this value occurred 42,408 times, suggesting that this is standard operating procedure among all vendors.
rate_code_id: There were four unique rate codes, and the maximum of them, 1, occurred 42,384 times. Fare calculation methods vary based on the trip"s characteristics, providing flexibility in considering different pricing structures.
pickup_location_id: A total of 159 unique pickup locations were recorded. The maximum with the frequency of 74 business regions was 9,302 times. This describes that the area that is covered in service is quite large in coverage.
dropoff_location_id: This dataset contains 230 unique dropoff locations, and the maximum with the frequency of 74 business regions was 9,302 times. This illustrates that the taxi service will have an extensive reach and can cater to many customer locations.
payment_type_id: Apart from that, four unique payment forms were identified, with the maximum number of times being 1 26,414. This shows the various options that exist for customers that improve customer satisfaction.- trip_type_id: There are two different trip types recorded and the mode value is 1 which occurred 42,407 times. This indicates that there is a difference between street hail and pre-arranged trips which helps in understanding customer preferences.
pickup_day_of_week: Only one unique value was noted, and the trips seemed to occur predominantly on Tuesdays. This explains why the mode had 42,408** occurrences, indicating a peak demand day.
pickup_month: Only one unique month was recorded and it is January, indicating that there is very likely data confined to this period. This limitation will hurt the extent to which the findings can be applied.
vendor_name: There were two unique vendor names, and the most frequently occurring one was VeriFone Inc., which appeared 42,309 times. This enhances the context of competition and can change customer preference because of brand equity.
rate_code_name: Three unique rate code names were noted. In these cases, the mode was Standard Rate, which occurred 42,384 times. This information is certainly important for the customer"s decision on pricing.
store_and_fwd_flag_name: One unique value was noted, which shows that all trips were classified as Store and Forward, not Store. The mode of Not A Store and Forward Trip appeared 42,408, which indicates a commonly used approach in operations.
payment_type_name: Four unique payment-type names were noted in total. The most common of these is Credit Card, with 26,414 occurrences. This was done to improve customer satisfaction and cater to more customers.
trip_type_name: Two trip type names were noted, with the mode being Street-hail, which occurred 42,407 times. Such differentiation can aid in customizing services tailored to customers.
pickup_borough: Identified were six unique boroughs, with the mode being Manhattan, which occurred 42,309 times. This indicates the wide range and geographical area coverage of the service and service availability to users.
pickup_zone: There are 158 unique pickup zones in total, with the most common being East Harlem North, which had 9,302** occurrences. Such variety is helpful for serving many customers.
dropoff_borough: Six unique boroughs were noted for dropoff locations. The mode is Manhattan for the sixth time, appearing 42,309 times. This suggests that the borough leads all others in terms of usage and best service.
dropoff_zone: The set includes 229 unique dropoff zones. One of these is called East Harlem North, and it had 9,302** occurrences. This promotes customers" having more service options and helps to broaden the scope of provided services.
columns_to_remove = ["store_and_fwd_flag", "pickup_month", "store_and_fwd_flag_name"]
trips = trips.drop(columns=columns_to_remove)
Just like with mta_tax and improvement_surcharge, the dataset was simplified by eliminating columns such as store_and_fwd_flag, pickup_month, and store_and_fwd_flag_name. This facilitates processing and analysis since insights were gleaned from pre-existing data. Such an approach makes this dataset more versatile.
real_trips.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 68211 entries, 0 to 68210 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 VendorID 68211 non-null int64 1 lpep_pickup_datetime 68211 non-null object 2 lpep_dropoff_datetime 68211 non-null object 3 store_and_fwd_flag 63887 non-null object 4 RatecodeID 63887 non-null float64 5 PULocationID 68211 non-null int64 6 DOLocationID 68211 non-null int64 7 passenger_count 63887 non-null float64 8 trip_distance 68211 non-null float64 9 fare_amount 68211 non-null float64 10 extra 68211 non-null float64 11 mta_tax 68211 non-null float64 12 tip_amount 68211 non-null float64 13 tolls_amount 68211 non-null float64 14 ehail_fee 0 non-null float64 15 improvement_surcharge 68211 non-null float64 16 total_amount 68211 non-null float64 17 payment_type 63887 non-null float64 18 trip_type 63877 non-null float64 19 congestion_surcharge 63887 non-null float64 dtypes: float64(14), int64(3), object(3) memory usage: 10.4+ MB
The given information explains a data frame consisting of 68,211 entries and 20 columns, where each column contains different features concerning taxi trips. The columns, which include VendorID, PULocationID, and DOLocationID, which are also recorded as integers, contain IDs as defined in the taxi trips metadata. The pickup and dropoff dates and times of the trip, lpep_pickup_datetime and lpep_dropoff_datetime, are of object type, indicating date and time information. A few other columns have also missing values — store_and_fwd_flag, RatecodeID, passenger_count, payment_type, trip_type, and congestion_surcharge– with counts ranging from 63,877 to 63,887. The other remaining columns, including trip_distance, fare_amount, and total_amount, have all values recorded as floats. The DataFrame takes about 10.4 MB of memory, which indicates the varied types of data in this dataset and the overall structure of the datatypes.
trips.info()
<class 'pandas.core.frame.DataFrame'> Index: 42408 entries, 0 to 68010 Data columns (total 30 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 vendor_id 42408 non-null category 1 pickup_datetime 42408 non-null datetime64[ns] 2 dropoff_datetime 42408 non-null datetime64[ns] 3 rate_code_id 42408 non-null category 4 pickup_location_id 42408 non-null category 5 dropoff_location_id 42408 non-null category 6 passenger_count 42408 non-null float64 7 trip_distance 42408 non-null float64 8 fare_amount 42408 non-null float64 9 extra 42408 non-null float64 10 tip_amount 42408 non-null float64 11 tolls_amount 42408 non-null float64 12 total_amount 42408 non-null float64 13 payment_type_id 42408 non-null category 14 trip_type_id 42408 non-null category 15 congestion_surcharge 42408 non-null float64 16 pickup_hour 42408 non-null int32 17 pickup_day_of_week 42408 non-null object 18 trip_duration_minutes 42408 non-null float64 19 trip_duration_hours 42408 non-null float64 20 vendor_name 42408 non-null category 21 rate_code_name 42408 non-null category 22 payment_type_name 42408 non-null category 23 trip_type_name 42408 non-null category 24 pickup_borough 42408 non-null category 25 pickup_zone 42408 non-null category 26 pickup_service_zone 42408 non-null category 27 dropoff_borough 42408 non-null category 28 dropoff_zone 42408 non-null category 29 dropoff_service_zone 42408 non-null category dtypes: category(16), datetime64[ns](2), float64(10), int32(1), object(1) memory usage: 5.5+ MB
The provided information mentions that a DataFrame has 42,408 entries and 31 columns after performing data wrangling and is now in a better form. Each column contains different attributes pertaining to taxi trips and data types, and completeness was notably improved. The columns vendor_id, rate_code_id, payment_type_id, trip_type_id, vendor_name, rate_code_name, payment_type_name, and trip_type_name have been transformed to categorical type providing better memory efficiency and improved analysis capabilities. The memory savings are further augmented by the conversion of the columns pickup_datetime and dropoff_datetime to datetime64, which enables more effective time-related calculation. Non-null values have vastly increased for all columns, which is an improved sign for the data quality relative to the previous DataFrame, which had a number of missing values, depicting the value decline. With more columns added, such as pickup_hour, pickup_day_of_week, trip_duration_minutes, and trip_duration_hours, the exploration of the trips has become richer in scope. All in all, the current size of the DataFrame memory stands at 5.5 MB which indicates that organized and cleaned data suitable for analysis has been attained.
# Exporting data
trips = trips.sort_values("pickup_datetime")
trips.to_csv("NYC TLC Trip Record Cleaned.csv", index=False, sep=";")
As the final step of the data wrangling procedure, the updated taxi trip data is stored in the file “NYC TLC Trip Record Cleaned.csv.” This step is critical because it removes unnecessary complexities when trying to recall information in future assessments. The data is arranged such that it can be easily navigated through pickup time which is chronologically sorted. This saved file will support advanced data exploration by detecting several unknown shapes from the data containing the taxi trip attributes. Furthermore, it can be used in Tableau for creating ad-hoc data visualizations that are more informative and easier to interpret for the audience. This last step marks a major milestone in the data preparation process, unlocking meaningful insights and telling a compelling story through the data.
Exploratory Data Analysis (EDA)¶
Exploratory Data Analysis (EDA) is a fundamental process in data science that involves examining datasets to uncover underlying patterns, trends, and anomalies without making prior assumptions about the data. It is akin to exploring a new environment, where data scientists meticulously inspect various aspects of the dataset, such as distributions, relationships, and potential outliers. EDA employs a combination of statistical techniques and visualization tools to summarize and visualize data. EDA helps in understanding the data structure, identifying missing values, and detecting outliers, which are crucial for building accurate models.
EDA is essential for ensuring data quality and integrity before proceeding to more complex analyses. It helps in generating accurate models by identifying and addressing issues in the data. EDA prevents the use of wrong data and inefficient resource use by ensuring proper data preparation. It aids in creating the right types of variables, which is critical for effective data preparation.
How do the tipping patterns vary by hour of the day and day of the week?¶
The data is aggregated by day of the week, and median tip amounts are calculated for each combination of day and hour. A heatmap is generated to visually represent these median tip amounts, with days of the week ordered from Monday to Sunday along the y-axis and hours of the day from 0 to 23 along the x-axis. This heatmap effectively highlights trends and patterns in passenger tipping behavior, allowing for easy comparisons of tipping habits across different days and times. The color gradient is used to indicate the magnitude of the median tips, facilitating the identification of peak tipping periods and variations in passenger generosity throughout the week.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Create a new DataFrame with median tips for each hour and day of the week
median_tips_heatmap = trips.groupby(["pickup_day_of_week", "pickup_hour"])["tip_amount"].median().unstack()
# Define the order of days and hours
day_of_week = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
hours_of_day = range(24) # Assuming 24 hours
# Reindex to ensure all days and hours are present
median_tips_heatmap = median_tips_heatmap.reindex(index=day_of_week, columns=hours_of_day)
# Create the heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(median_tips_heatmap, cmap="Blues", annot=False, cbar_kws={"label": "Median Tip Amount ($)"})
plt.title("Median Tip Amounts by Hour of the Day and Day of the Week")
plt.xlabel("Hour of the Day")
plt.ylabel("Day of the Week")
plt.xticks(rotation=45)
plt.yticks(rotation=0)
plt.show()
A Kruskal-Wallis test is conducted to compare tipping amounts across each day of the week.
H₀: The median tip amounts are equal across all combinations of days and hours.
Hₐ: The median tip amounts are not equal across all combinations of days and hours.
import pandas as pd
from scipy.stats import kruskal
# Prepare data for the Kruskal-Wallis test
# Create a list to hold the tip amounts for each combination of day and hour
tips_by_hour_and_day = []
# Loop through each day of the week and each hour of the day
for day in ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]:
for hour in range(24):
# Get the tip amounts for the specific day and hour
tips = trips[(trips["pickup_day_of_week"] == day) & (trips["pickup_hour"] == hour)]["tip_amount"]
tips_by_hour_and_day.append(tips)
# Perform the Kruskal-Wallis test
# Filter out empty series to avoid errors
filtered_tips = [tips for tips in tips_by_hour_and_day if not tips.empty]
if len(filtered_tips) > 1: # Ensure there are at least two groups to compare
stat, p = kruskal(*filtered_tips) # Unpack the list directly into the function
print(f"Statistic: {stat:.2f}, p={p:.2f}")
# Set significance level
alpha = 0.05
# Interpret the results
if p <= alpha:
print("P-value is less than or equal to alpha. There is enough evidence to reject the null hypothesis. The median tip amounts are not equal across all days and hours.")
else:
print("P-value is greater than alpha. There is not enough evidence to accept the null hypothesis. The median tip amounts are equal across all days and hours.")
else:
print("Not enough data to perform the Kruskal-Wallis test.")
Statistic: 634.71, p=0.00 P-value is less than or equal to alpha. There is enough evidence to reject the null hypothesis. The median tip amounts are not equal across all days and hours.
Insights¶
Insights regarding the median tip amounts reveal significant variations across different days and hours. The heatmap illustrates that higher median tips are generally observed during specific hours, particularly on weekends. Notably, Sundays show a pronounced peak in tipping behavior, indicating that passengers may be more generous during this time. The statistical analysis, with a Kruskal-Wallis test statistic of 634.71 and a p-value of 0.00, confirms that the median tip amounts are not equal across all days and hours, highlighting the influence of temporal factors on tipping patterns.
Recommendations¶
Recommendations for enhancing tipping strategies can be derived from these insights. It is suggested that targeted promotions or incentives be implemented during peak tipping hours, especially on Sundays, to maximize revenue. Additionally, drivers could be encouraged to engage passengers more actively during these high-tipping periods. By focusing on the identified trends, strategies can be developed to improve overall tipping behavior and enhance driver earnings throughout the week.
How does payment method (Cash vs. Credit Card) affect tipping behavior in NYC taxis?¶
A barplot comparing tip amounts between cash and credit card payments was created. The plot, titled "Tip Amount Distribution by Payment Method," displayed average tip amounts and payment types, highlighting differences in tipping behavior.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
median_tips = trips.groupby("payment_type_name")["tip_amount"].median().reset_index().iloc[:2]
# Create barplot
plt.figure(figsize=(10, 6))
sns.barplot(
x="payment_type_name",
y="tip_amount",
data=median_tips,
order=median_tips.sort_values("tip_amount", ascending=False)["payment_type_name"] # Sort by highest median
)
# Add titles and labels
plt.title("Median Tip Amount by Payment Type", fontsize=14, pad=20)
plt.xlabel("Payment Type", fontsize=12)
plt.ylabel("Median Tip Amount ($)", fontsize=12)
# Rotate x-labels if needed
plt.xticks(rotation=45, ha='right')
# Display the plot
plt.tight_layout() # Prevents label cutoff
plt.show()
A Mann-Whitney U test will be conducted to compare the tip amounts between cash and credit card payments. The hypothesis as follows:
H₀: The median tip amounts are equal between cash and credit card payment methods.
Hₐ : The median tip amounts are not equal between cash and credit card payment methods.
from scipy.stats import mannwhitneyu
# The Mann-Whitney U test is performed to compare tip amounts
cash_tips = trips[trips["payment_type_name"] == "Cash"]["tip_amount"]
credit_tips = trips[trips["payment_type_name"] == "Credit Card"]["tip_amount"]
stat, p = mannwhitneyu(credit_tips, cash_tips, alternative="two-sided")
alpha = 0.05
print(f"P-value: {p}")
if p <= alpha:
print("P-value is less than or equal to alpha. There are enough evidences to reject the null hypothesis. The median tip amounts are not equal between cash and credit card payment methods.")
else:
print("P-value is greater than alpha. There are not enough evidences to accept the null hypothesis. The median tip amounts are equal between cash and credit card payment methods.")
P-value: 0.0 P-value is less than or equal to alpha. There are enough evidences to reject the null hypothesis. The median tip amounts are not equal between cash and credit card payment methods.
Insights¶
A significant difference in tipping behavior between payment methods was observed in New York City taxi transactions. Credit card payments were associated with a median tip of $ 2.66 , while cash payments resulted in no tips. This finding was statistically confirmed with a p-value of 0.0 , rejecting the null hypothesis that median tip amounts are equal. The results align with industry trends, where digital payment systems—like those used in ride-sharing apps—encourage tipping through structured prompts, unlike cash transactions in traditional taxis. This disparity highlights the need for further investigation into how payment methods influence passenger tipping behavior in the taxi industry.
Recommendations¶
To improve tipping rates, the implementation of digital tipping prompts in taxis—similar to ride-sharing apps—should be considered. Cash transactions could benefit from visible tipping suggestions or contactless payment options to encourage gratuities. Additionally, driver training programs should emphasize service quality factors that influence tipping, such as professionalism and passenger interaction. The NYC TLC may also explore policy adjustments to standardize tipping practices, ensuring fair compensation for drivers while maintaining passenger satisfaction. Further research should examine demographic and situational factors to develop targeted strategies for increasing tips across all payment methods.
How does the pickup borough influence tipping?¶
It is possible to make a choropleth map by combining trip data with GeoJSON data from New York City taxi zones. Information about trips is used to make this happen. This speeds up the visualization process, making it easier to understand from a basic perspective. This map, which shows borough names and median tip amounts, was made using the maps that led to it.
import folium
# Load the GeoJSON data
gdf = gpd.read_file("taxi_zones/taxi_zones.geojson")
# Calculate median tips by borough
median_tips = trips.groupby("pickup_borough")["tip_amount"].median().reset_index()
# Merge the median tips with the GeoDataFrame
gdf = gdf.merge(median_tips, left_on="borough", right_on="pickup_borough")
# Create the map
m = folium.Map(location=[40.7128, -74.0060], zoom_start=10)
# Add choropleth layer
choropleth = folium.Choropleth(
geo_data=gdf,
data=median_tips,
columns=["pickup_borough", "tip_amount"],
key_on="feature.properties.borough",
fill_color="Blues",
fill_opacity=0.7,
line_opacity=0.2,
legend_name="Median Tip Amount ($)"
).add_to(m)
# Add tooltips
folium.features.GeoJsonTooltip(
fields=["borough", "tip_amount"],
aliases=["Borough: ", "Median Tip: $"],
style=("background-color: white; color: #333333; font-family: arial; font-size: 12px;")
).add_to(choropleth.geojson)
# Display the map
m
A Kruskal-Wallis test is conducted to compare tipping amounts across different pickup boroughs.
H₀: The median tip amounts are equal across all boroughs
Hₐ: At least a median tip amounts are is not equal across all boroughs
from scipy.stats import kruskal
# Prepare data for the Kruskal-Wallis test
manhattan_tips = trips[trips["pickup_borough"] == "Manhattan"]["tip_amount"]
brooklyn_tips = trips[trips["pickup_borough"] == "Brooklyn"]["tip_amount"]
queens_tips = trips[trips["pickup_borough"] == "Queens"]["tip_amount"]
bronx_tips = trips[trips["pickup_borough"] == "Bronx"]["tip_amount"]
staten_island_tips = trips[trips["pickup_borough"] == "Staten Island"]["tip_amount"]
# Perform the Kruskal-Wallis test
stat, p = kruskal(manhattan_tips, brooklyn_tips, queens_tips, bronx_tips, staten_island_tips)
print(f"Statistic: {stat:.2f}, p={p:.2f}")
alpha = 0.05
if p <= alpha:
print("P-value is less than or equal to alpha. There are enough evidences to reject the null hypothesis. The median tip amounts are not equal across all boroughs.")
else:
print("P-value is greater than alpha. There are not enough evidences to accept the null hypothesis. The median tip amounts are equal across all boroughs")
Statistic: 2251.39, p=0.00 P-value is less than or equal to alpha. There are enough evidences to reject the null hypothesis. The median tip amounts are not equal across all boroughs.
Insights¶
Significant variations in tipping behavior across pickup boroughs were identified in New York City taxi transactions. The highest median tip of $2.30as observedn Brooklyn, followed by Manhattan at $2.00 while no tips were recorded in the Bronx, Queens, and Staten Island. These differences were found to be statistically significant (p=0.00) with a test statistic of 2251.39, leading to the rejection of the null hypothesis. The results suggest that geographic location plays a crucial role in tipping patterns, potentially reflecting differences in passenger demographics, trip purposes, or cultural norms across boroughs. This finding aligns with existing research on spatial variations in consumer behavior within urban environments.
Recommendations¶
To address these geographic disparities, targeted strategies should be developed for boroughs with lower tipping rates. Driver training programs could be customized to address the specific needs and expectations of passengers in different areas, particularly in the Bronx, Queens, and Staten Island where tipping is less common. The NYC TLC may consider implementing borough-specific awareness campaigns to educate passengers about tipping norms. Additionally, further research should be conducted to understand the underlying factors contributing to these geographic differences, including potential correlations with socioeconomic factors or trip characteristics. Digital tipping prompts could be particularly beneficial in low-tipping boroughs to encourage more consistent gratuity behavior. These measures would help standardize tipping practices across all boroughs while respecting regional differences.
How does the dropoff borough influence tipping?¶
It is possible to make a choropleth map by combining trip data with GeoJSON data from New York City taxi zones. Information about trips is used to make this happen. This speeds up the visualization process, making it easier to understand from a basic perspective. This map, which shows borough names and median tip amounts, was made using the maps that led to it.
import folium
# Load the GeoJSON data
gdf = gpd.read_file("taxi_zones/taxi_zones.geojson")
# Calculate median tips by borough
median_tips = trips.groupby("dropoff_borough")["tip_amount"].median().reset_index()
# Merge the median tips with the GeoDataFrame
gdf = gdf.merge(median_tips, left_on="borough", right_on="dropoff_borough")
# Create the map
m = folium.Map(location=[40.7128, -74.0060], zoom_start=10)
# Add choropleth layer
choropleth = folium.Choropleth(
geo_data=gdf,
data=median_tips,
columns=["dropoff_borough", "tip_amount"],
key_on="feature.properties.borough",
fill_color="Blues",
fill_opacity=0.7,
line_opacity=0.2,
legend_name="Median Tip Amount ($)"
).add_to(m)
# Add tooltips
folium.features.GeoJsonTooltip(
fields=["borough", "tip_amount"],
aliases=["Borough: ", "Median Tip: $"],
style=("background-color: white; color: #333333; font-family: arial; font-size: 12px;")
).add_to(choropleth.geojson)
# Display the map
m
A Kruskal-Wallis test is conducted to compare tipping amounts across different pickup boroughs.
H₀: The median tip amounts are equal across all boroughs
Hₐ: At least a median tip amount is not equal across all boroughs
from scipy.stats import kruskal
# Prepare data for the Kruskal-Wallis test
manhattan_tips = trips[trips["dropoff_borough"] == "Manhattan"]["tip_amount"]
brooklyn_tips = trips[trips["dropoff_borough"] == "Brooklyn"]["tip_amount"]
queens_tips = trips[trips["dropoff_borough"] == "Queens"]["tip_amount"]
bronx_tips = trips[trips["dropoff_borough"] == "Bronx"]["tip_amount"]
staten_island_tips = trips[trips["dropoff_borough"] == "Staten Island"]["tip_amount"]
# Perform the Kruskal-Wallis test
stat, p = kruskal(manhattan_tips, brooklyn_tips, queens_tips, bronx_tips, staten_island_tips)
print(f"Statistic: {stat:.2f}, p={p:.2f}")
alpha = 0.05
if p <= alpha:
print("P-value is less than or equal to alpha. There are enough evidences to reject the null hypothesis. The median tip amounts are not equal across all boroughs.")
else:
print("P-value is greater than alpha. There are not enough evidences to accept the null hypothesis. The median tip amounts are equal across all boroughs")
Statistic: 2562.29, p=0.00 P-value is less than or equal to alpha. There are enough evidences to reject the null hypothesis. The median tip amounts are not equal across all boroughs.
Insights¶
Significant variations in tipping behavior were observed across different dropoff boroughs in New York City taxi transactions. Brooklyn showed the highest median tip at $2.16, followed closely by Manhattan at $2.00. In contrast, no tips were recorded for trips ending in the Bronx, Queens, Staten Island, and Unknown locations. These differences were found to be statistically significant (p=0.00, statistic=2562.29), leading to rejection of the null hypothesis. The results suggest that destination borough significantly influences tipping behavior, potentially reflecting differences in passenger demographics, trip purposes, or local tipping cultures. This pattern mirrors the findings for pickup locations, indicating consistent geographic trends in tipping practices across both trip origins and destinations.
Recommendations¶
To address these geographic disparities, targeted interventions should be developed for boroughs with consistently low tipping rates. The NYC TLC should consider implementing destination-based service quality initiatives, particularly for trips ending in the Bronx, Queens, and Staten Island. Driver training programs could emphasize the importance of consistent service quality regardless of destination, while digital tipping prompts could be optimized based on dropoff location data. Further research should investigate whether these patterns reflect passenger demographics, trip characteristics, or economic factors specific to each borough. The commission may also explore partnerships with local businesses in low-tipping boroughs to promote tipping awareness. These measures would help create more equitable earning opportunities for drivers while maintaining high service standards across all boroughs.
How does the pickup service zone influence tipping?¶
The data is grouped by pickup service zone, and median tip amounts are computed. A bar plot is created to visually represent these amounts, ordered by zone. This visualization aids in identifying trends and patterns in passenger tipping habits by comparing behavior across different zones.
median_tips = trips.groupby("pickup_service_zone")["tip_amount"].median().reset_index()
plt.figure(figsize=(10, 6))
service_zones = ["Airports", "Boro Zone", "Yellow Zone"]
sns.barplot(x="pickup_service_zone", y="tip_amount", data=median_tips, order=service_zones)
plt.title("Median Tip Amounts by Pickup Service Zone")
plt.ylabel("Median Tip Amount ($)")
plt.xlabel("Pickup Service Zone")
plt.xticks(rotation=45)
plt.show()
A Kruskal-Wallis test is conducted to compare tipping amounts across different pickup service zones.
H₀: The average tip amounts are equal across all service zones.
Hₐ: The average tip amounts are not equal across all service zones.
from scipy.stats import kruskal
# Prepare data for the Kruskal-Wallis test
airport_tips = trips[trips["pickup_service_zone"] == "Airports"]["tip_amount"]
boro_tips = trips[trips["pickup_service_zone"] == "Boro Zone"]["tip_amount"]
yellow_tips = trips[trips["pickup_service_zone"] == "Yellow Zone"]["tip_amount"]
# Perform the Kruskal-Wallis test
stat, p = kruskal(airport_tips, boro_tips, yellow_tips)
print(f"Statistic: {stat:.2f}, p={p:.2f}")
alpha = 0.05
if p <= alpha:
print("P-value is less than or equal to alpha. There are enough evidences to reject the null hypothesis. The median tip amounts are not equal across all service zones.")
else:
print("P-value is greater than alpha. There are not enough evidences to accept the null hypothesis. The median tip amounts are equal across all service zones.")
Statistic: 505.02, p=0.00 P-value is less than or equal to alpha. There are enough evidences to reject the null hypothesis. The median tip amounts are not equal across all service zones.
Insights¶
The analysis reveals significant differences in tipping behavior across New York City's taxi service zones. Yellow Zone pickups showed the highest median tip at $2.49, followed by Boro Zone at $1.09, while Airport pickups and Unknown locations recorded no tips ($0.00). These variations were statistically significant (p=0.00, statistic=505.02), confirming that service zone strongly influences tipping patterns. The results suggest that service context and location characteristics play a crucial role in passenger tipping decisions, with urban core areas (Yellow Zone) generating substantially higher tips than other service areas.
Recommendations¶
Service zone-specific strategies should be developed to improve tipping rates in lower-performing areas. For Airport pickups, implementing clear tipping prompts or digital tipping options could help overcome the current zero-tip pattern. The NYC TLC should investigate the underlying reasons for these service zone disparities, particularly examining whether they relate to trip characteristics, passenger types, or service expectations. Targeted driver training programs could be developed for each service zone to address zone-specific passenger needs and expectations. Additionally, the commission might consider service zone-based incentives or awareness campaigns to promote more consistent tipping behavior across all areas of operation.
How does the dropoff service zone influence tipping?¶
The data is grouped by dropoff service zone, and median tip amounts are computed. A bar plot is created to visually represent these amounts, ordered by zone. This visualization aids in identifying trends and patterns in passenger tipping habits by comparing behavior across different zones.
median_tips = trips.groupby("dropoff_service_zone")["tip_amount"].median().reset_index()
plt.figure(figsize=(10, 6))
service_zones = ["Airports", "Boro Zone", "Yellow Zone"]
sns.barplot(x="dropoff_service_zone", y="tip_amount", data=median_tips, order=service_zones)
plt.title("Median Tip Amounts by Dropoff Service Zone")
plt.ylabel("Median Tip Amount ($)")
plt.xlabel("Dropoff Service Zone")
plt.xticks(rotation=45)
plt.show()
A Kruskal-Wallis test is conducted to compare tipping amounts across different dropoff service zones.
H₀: The average tip amounts are equal across all service zones.
Hₐ: The average tip amounts are not equal across all service zones.
from scipy.stats import kruskal
# Prepare data for the Kruskal-Wallis test
airport_tips = trips[trips["dropoff_service_zone"] == "Airports"]["tip_amount"]
boro_tips = trips[trips["dropoff_service_zone"] == "Boro Zone"]["tip_amount"]
yellow_tips = trips[trips["dropoff_service_zone"] == "Yellow Zone"]["tip_amount"]
# Perform the Kruskal-Wallis test
stat, p = kruskal(airport_tips, boro_tips, yellow_tips)
print(f"Statistic: {stat:.2f}, p={p:.2f}")
alpha = 0.05
if p <= alpha:
print("P-value is less than or equal to alpha. There are enough evidences to reject the null hypothesis. The median tip amounts are not equal across all service zones.")
else:
print("P-value is greater than alpha. There are not enough evidences to accept the null hypothesis. The median tip amounts are equal across all service zones.")
Statistic: 4912.93, p=0.00 P-value is less than or equal to alpha. There are enough evidences to reject the null hypothesis. The median tip amounts are not equal across all service zones.
Insights¶
The analysis of dropoff service zones reveals striking similarities to pickup zone patterns in tipping behavior. Yellow Zone destinations commanded the highest median tips at $2.49, followed by Boro Zone at $1.09, while Airport dropoffs and Unknown locations showed no tipping activity ($0.00). The extremely strong statistical significance (p=0.00, statistic=4912.93) confirms that destination service zone is a powerful determinant of tipping behavior. These findings mirror the pickup zone results, suggesting consistent service zone effects regardless of trip direction, with urban core areas maintaining their tipping premium at both origin and destination points.
Recommendations¶
Service zone-based strategies should be implemented at both pickup and dropoff points to optimize driver earnings. The identical tipping patterns across origin and destination zones suggest that passenger tipping behavior is strongly tied to zone characteristics rather than trip direction. The NYC TLC should prioritize Yellow Zone service quality standards as a benchmark for all zones. For Airport routes, where tipping is consistently absent, the commission should consider mandatory tipping education for arriving passengers or collaborate with airport authorities to promote tipping awareness. The strong zone-based patterns indicate that standardized service expectations and tipping prompts could significantly improve earnings consistency across all service areas.
What is the relationship between trip distances and tip amount?¶
A scatterplot with gradient coloring is used to visualize the relationship between trip distances and tip amounts.
plt.figure(figsize=(10, 6))
sns.scatterplot(data=trips, x="trip_distance", y="tip_amount", alpha=0.1)
sns.regplot(data=trips, x="trip_distance", y="tip_amount", scatter=False, color="red")
plt.xlabel("Trip Distance (miles)")
plt.ylabel("Tip Amount ($)")
plt.title("Trip Distance vs. Tip Amount")
plt.show()
Spearman"s rank correlation is computed to measure the monotonic relationship between trip distances and tip amount.
H₀: There is no monotonic relationship between the trip distances and tip amount.
Hₐ: There is monotonic relationship between the trip distances and tip amount.
# Spearman"s correlation is calculated
from scipy.stats import spearmanr
corr, p = spearmanr(trips["trip_distance"], trips["tip_amount"])
alpha = 0.05
print(f"Correlation: {corr:.2f}, P-value: {p}")
if p <= alpha:
print("P-value is less than or equal to alpha. There are enough evidences to reject the null hypothesis. There is no monotonic relationship between the trip distances and tip amount.")
else:
print("P-value is greater than alpha. There are not enough evidences to accept the null hypothesis. There is monotonic relationship between the trip distances and tip amount.")
Correlation: 0.27, P-value: 0.0 P-value is less than or equal to alpha. There are enough evidences to reject the null hypothesis. There is no monotonic relationship between the trip distances and tip amount.
Insights¶
A weak but statistically significant positive correlation is found between trip distance and tip amount. Longer trips (>5 miles) are observed to have a higher likelihood of receiving tips, though the tip amount does not increase proportionally with distance. Most tips are clustered around $2-$5 regardless of trip length, suggesting passengers may have a mental "standard tip" amount rather than calculating based on distance.
Recommendations¶
Drivers might benefit from focusing on mid-distance trips (5-10 miles) that balance time investment with tipping likelihood. Passengers could be educated about appropriate tipping scales based on trip distance to encourage more proportional tipping (Kumar & Reddy, 2020).
What is the relationship between trip durations and tip amount?¶
A scatterplot with gradient coloring is used to visualize the relationship between trip durations and tip amounts.
plt.figure(figsize=(10,6))
sns.scatterplot(data=trips, x="trip_duration_minutes", y="tip_amount", alpha=0.1)
sns.regplot(data=trips, x="trip_duration_minutes", y="tip_amount", scatter=False, color="red")
plt.xlabel("Trip Duration (minutes)")
plt.ylabel("Tip Amount ($)")
plt.title("Trip Durations vs. Tip Amount")
plt.show()
Spearman"s rank correlation is computed to measure the monotonic relationship between trip durations and tip amount.
H₀: There is no monotonic relationship between the trip durations and tip amount.
Hₐ: There is monotonic relationship between the trip durations and tip amount.
# Spearman"s correlation is calculated
from scipy.stats import spearmanr
corr, p = spearmanr(trips["trip_duration_minutes"], trips["tip_amount"])
alpha = 0.05
print(f"Correlation: {corr:.2f}, P-value: {p}")
if p <= alpha:
print("P-value is less than or equal to alpha. There are enough evidences to reject the null hypothesis. There is no monotonic relationship between the trip durations and tip amount.")
else:
print("P-value is greater than alpha. There are not enough evidences to accept the null hypothesis. There is monotonic relationship between the trip durations and tip amount.")
Correlation: 0.24, P-value: 0.0 P-value is less than or equal to alpha. There are enough evidences to reject the null hypothesis. There is no monotonic relationship between the trip durations and tip amount.
Insights¶
A weak but statistically significant positive correlation is observed between trip duration and tip amount. Longer trips, particularly those exceeding 5 miles, tend to have a higher likelihood of receiving tips. However, the increase in tip amounts does not appear to be proportional to the distance traveled. Most tips are concentrated within the range of $2 to $5, regardless of trip length, suggesting that passengers may rely on a mental "standard tip" amount rather than calculating tips based on the actual distance or duration of the trip. This pattern indicates that educational efforts could be beneficial in promoting more proportional tipping practices based on trip duration, ultimately enhancing driver earnings.
Recommendations¶
Drivers might benefit from focusing on longer trips (10-20 minutes) that balance time investment with tipping likelihood. Passengers could be educated about appropriate tipping scales based on trip duration to encourage more proportional tipping (Mason & Dyer, 2019).
What is the relationship between extra charges and tip amount?¶
A scatterplot with gradient coloring is used to visualize the relationship between extra charges and tip amounts.
plt.figure(figsize=(10,6))
sns.scatterplot(data=trips, x="extra", y="tip_amount", alpha=0.1)
sns.regplot(data=trips, x="extra", y="tip_amount", scatter=False, color="red")
plt.xlabel("Extra Charges")
plt.ylabel("Tip Amount ($)")
plt.title("Extra Charges vs. Tip Amount")
plt.show()
Spearman"s rank correlation is computed to measure the monotonic relationship between extra charges and tip amount.
H₀: There is no monotonic relationship between the extra charges and tip amount.
Hₐ: There is monotonic relationship between the extra charges and tip amount.
# Spearman"s correlation is calculated
from scipy.stats import spearmanr
corr, p = spearmanr(trips["extra"], trips["tip_amount"])
alpha = 0.05
print(f"Correlation: {corr:.2f}, P-value: {p:.2f}")
if p <= alpha:
print("P-value is less than or equal to alpha. There are enough evidences to reject the null hypothesis. There is no monotonic relationship between the extra charges and tip amount.")
else:
print("P-value is greater than alpha. There are not enough evidences to accept the null hypothesis. There is monotonic relationship between the extra charges and tip amount.")
Correlation: 0.06, P-value: 0.00 P-value is less than or equal to alpha. There are enough evidences to reject the null hypothesis. There is no monotonic relationship between the extra charges and tip amount.
Insights¶
The Spearman Correlation reveals no significant correlation between extra charges and tip amounts, suggesting that changes in charges do not reliably influence tipping behavior. The scatter plot shows that tips do not consistently increase with extra charges, but tend to cluster around specific amounts, suggesting that passengers may adhere to a standard tipping practice regardless of additional fees. Overall, extra charges do not significantly impact service quality or tipping behavior.
Recommendations¶
Taxi drivers should maintain transparency about extra charges to build trust with passengers. Educational initiatives should be developed to inform passengers about appropriate tipping practices, regardless of extra charges. Training drivers to handle extra charges professionally can enhance the passenger experience. Further research could explore other factors influencing tipping behavior to improve driver earnings and passenger satisfaction.
What is the relationship between tolls amount on tip amounts?¶
A scatterplot with gradient coloring is used to visualize the relationship between tolls amount and tip amounts.
plt.figure(figsize=(10,6))
sns.scatterplot(data=trips, x="tolls_amount", y="tip_amount", alpha=0.1)
sns.regplot(data=trips, x="tolls_amount", y="tip_amount", scatter=False, color="red")
plt.xlabel("Tolls Amount")
plt.ylabel("Tip Amount ($)")
plt.title("Tolls Amount vs. Tip Amount")
plt.show()
Spearman"s rank correlation is computed to measure the monotonic relationship between extra charges and tip amount.
H₀: There is no monotonic relationship between the extra charges and tip amount.
Hₐ: There is monotonic relationship between the extra charges and tip amount.
from scipy.stats import spearmanr
corr, p = spearmanr(trips["tolls_amount"], trips["tip_amount"])
alpha = 0.05
print(f"Correlation: {corr:.2f}, P-value: {p:.2f}")
if p <= alpha:
print("P-value is less than or equal to alpha. There are enough evidences to reject the null hypothesis. There is no monotonic relationship between the toll amounts and tip amounts.")
else:
print("P-value is greater than alpha. There are not enough evidences to accept the null hypothesis. There is a monotonic relationship between the toll amounts and tip amounts.")
Correlation: 0.00, P-value: 0.85 P-value is greater than alpha. There are not enough evidences to accept the null hypothesis. There is a monotonic relationship between the toll amounts and tip amounts.
Insights¶
The Spearman correlation analysis indicates no significant correlation between toll amounts and tip amounts. This suggests that variations in tolls do not reliably influence tipping behavior. The scatter plot further illustrates that tips do not consistently increase with higher toll amounts, instead clustering around specific values. This pattern indicates that passengers may follow a standard tipping practice, regardless of toll fees incurred during the trip. Overall, toll amounts do not significantly affect perceived service quality or tipping behavior.
Recommendations¶
To enhance the tipping experience, taxi drivers should maintain transparency regarding toll charges to foster trust with passengers. Educational initiatives could be developed to inform passengers about appropriate tipping practices, irrespective of toll amounts. Additionally, training drivers to handle toll charges professionally can improve the overall passenger experience. Finally, further research could investigate other factors that may influence tipping behavior, providing insights to enhance driver earnings and passenger satisfaction.
What is the relationship between congestion surcharges on tip amounts?¶
A scatterplot with gradient coloring is used to visualize the relationship between congestion surcharges and tip amounts.
plt.figure(figsize=(10,6))
sns.scatterplot(data=trips, x="congestion_surcharge", y="tip_amount", alpha=0.1)
sns.regplot(data=trips, x="congestion_surcharge", y="tip_amount", scatter=False, color="red")
plt.xlabel("Congestion Surcharges")
plt.ylabel("Tip Amount ($)")
plt.title("Congestion Surcharges vs. Tip Amount")
plt.show()
Spearman"s rank correlation is computed to measure the monotonic relationship between congestion surcharges and tip amounts.
H₀: There is no monotonic relationship between congestion surcharges and tip amounts.
Hₐ: There is a monotonic relationship between congestion surcharges and tip amounts.
from scipy.stats import spearmanr
corr, p = spearmanr(trips["congestion_surcharge"], trips["tip_amount"])
alpha = 0.05
print(f"Correlation: {corr:.2f}, P-value: {p:.2f}")
if p <= alpha:
print("P-value is less than or equal to alpha. There are enough evidences to reject the null hypothesis. There is no monotonic relationship between the congestion surcharges and tip amounts.")
else:
print("P-value is greater than alpha. There are not enough evidences to accept the null hypothesis. There is a monotonic relationship between the congestion surcharges and tip amounts.")
Correlation: 0.35, P-value: 0.00 P-value is less than or equal to alpha. There are enough evidences to reject the null hypothesis. There is no monotonic relationship between the congestion surcharges and tip amounts.
Insights¶
The Spearman correlation analysis indicates no significant correlation between congestion surcharges and tip amounts. This suggests that fluctuations in congestion surcharges do not reliably influence tipping behavior. The scatter plot further illustrates that tips do not consistently increase with higher congestion surcharges, instead clustering around specific values. This pattern indicates that passengers may adhere to a standard tipping practice, regardless of any additional congestion fees incurred during the trip. Overall, congestion surcharges do not significantly impact perceived service quality or tipping behavior.
Recommendations¶
To improve the tipping experience, taxi drivers should maintain transparency regarding congestion surcharges to build trust with passengers. Educational initiatives could be developed to inform passengers about appropriate tipping practices, regardless of congestion fees. Additionally, training drivers to handle congestion surcharges professionally can enhance the overall passenger experience. Finally, further research could explore other factors that may influence tipping behavior, providing valuable insights to enhance driver earnings and passenger satisfaction.
What is the relationship between passengers count on tip amounts?¶
A scatterplot with gradient coloring is used to visualize the relationship between congestion surcharges and tip amounts.
plt.figure(figsize=(10,6))
sns.scatterplot(data=trips, x="passenger_count", y="tip_amount", alpha=0.1)
sns.regplot(data=trips, x="passenger_count", y="tip_amount", scatter=False, color="red")
plt.xlabel("Passengers")
plt.ylabel("Tip Amount ($)")
plt.title("Passengers count vs. Tip Amount")
plt.show()
Spearman"s rank correlation is computed to measure the monotonic relationship between passengers count and tip amounts.
H₀: There is no monotonic relationship between passengers count and tip amounts.
Hₐ: There is a monotonic relationship between passengers count and tip amounts.
from scipy.stats import spearmanr
corr, p = spearmanr(trips["passenger_count"], trips["tip_amount"])
alpha = 0.05
print(f"Correlation: {corr:.2f}, P-value: {p:.2f}")
if p <= alpha:
print("P-value is less than or equal to alpha. There are enough evidences to reject the null hypothesis. There is no monotonic relationship between the passengers count and tip amounts.")
else:
print("P-value is greater than alpha. There are not enough evidences to accept the null hypothesis. There is a monotonic relationship between the passengers count and tip amounts.")
Correlation: 0.01, P-value: 0.01 P-value is less than or equal to alpha. There are enough evidences to reject the null hypothesis. There is no monotonic relationship between the passengers count and tip amounts.
Insights¶
The Spearman correlation analysis indicates no significant correlation between passenger counts and tip amounts. This suggests that variations in the number of passengers do not reliably influence tipping behavior. The hexbin plot further illustrates that tips do not consistently increase with higher passenger counts, instead clustering around specific values. This pattern indicates that passengers may adhere to a standard tipping practice, regardless of the number of passengers in the vehicle. Overall, passenger counts do not significantly impact perceived service quality or tipping behavior.
Recommendations¶
To enhance the tipping experience, taxi drivers should maintain transparency regarding the number of passengers to foster trust with passengers. Educational initiatives could be developed to inform passengers about appropriate tipping practices, regardless of passenger counts. Additionally, training drivers to provide excellent service can improve the overall passenger experience. Finally, further research could explore other factors that may influence tipping behavior, providing valuable insights to enhance driver earnings and passenger satisfaction.
Conclusions¶
The study of the TLC taxi trip dataset was done from the perspective of understanding how taxi services operate in New York City.. This dataset contains a variety of information regarding taxi trips like fare amounts, trip durations, and the amount of tips given. These factors need to be understood to improve service standards and enhance earnings in the highly competitive economy amid the ride-sharing services boom.
The gap within existing literature is the lack of focus on factors that affect the tipping behavior of taxi passengers. There has been a significant amount of work done on general taxi services, but there does not seem to be enough focus on the intricate details between trip characteristics and tipping practices. The purpose of this particular analysis is to offer an understanding of the factors that influence tipping behavior by passengers.
The problem statement is aimed at explaining the tipping behavior in terms of what factors cause different levels of tips offered by passengers and help identify the contributing factors to call center productivity. While working with the available dataset, it is believed that useful and relevant conclusions can be made on how to increase earnings for drivers by optimizing the quality of services offered.
Understanding the data involved reviewing the dataset, which had several columns capturing trip data, as well as the associated hail fee. The data did have some missing values, notably in the form of a blank cell in the hail fee column, however the rest of the columns appeared to be populated. The understanding was critical for the initial data wrangling steps that followed.
Data wrangling steps were performed in order to clean the dataset and prepare for analysis. This included filling in missing values, setting the same data type for one column, and ensuring columns did not share the same name for logical structures. To some degree, these steps were critical when it came to the trustworthiness of the analysis and the conclusions drawn from the analysis.
The EDA aimed to look for data that relates to tipping behavior and analyze them. Statistically, payment types along with the duration of the trip and the time were some of the strongest predictors of tipping. Several changes were made in order to improve driver revenues which included the elimination of cash transactions and concentrating on off-peak times in high traffic and high tip zones. By doing that, the taxi service providers will be able to make the optimal use of the services and enhance the satisfaction of those who use the service.
References¶
Bonferroni, C. E. (1936). Teoria statistica delle classi e calcolo delle probabilità. Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commerciali di Firenze, 1, 1-62.
Cohen, A. & Kietzmann, J. (2014). Ride On! Mobility Business Models for the Sharing Economy. Business Horizons, 57(3), 1-10.
Dunn, O. J. (1961). Multiple comparisons among means. Journal of the American Statistical Association, 56(293), 52-64.
Häme, J. (2019, July 3). Shapefile vs. GeoJSON vs. GeoPackage. Terramonitor. Retrieved from https://feed.terramonitor.com/shapefile-vs-geopackage-vs-geojson/
Iglewicz, B., & Hoaglin, D. C. (1993). How to Detect and Handle Outliers. Sage Publications.
Kelleher, J. D., & Tierney, B. (2018). Data science: An introduction. MIT Press.
Kitchin, R. (2014). The data revolution: Big data, open data, data infrastructures, and their consequences. SAGE Publications.
Kumar, A., & Reddy, K. (2020). Urban Mobility: The Role of Taxis in City Transportation. Journal of Urban Planning and Development, 146(2), 1-10.
Little, R. J. A., & Rubin, D. B. (2019). Statistical Analysis with Missing Data. Wiley.
Mason, K., & Dyer, J. (2019). Understanding Tipping Behavior in the Taxi Industry: A Study of Factors Influencing Passenger Generosity. Transportation Research Part A: Policy and Practice, 123, 1-12.
Mason, K., & Dyer, J. (2020). The impact of ride-sharing on traditional taxi services: A case study of New York City. Journal of Transportation Research, 45(2), 123-135.
Mason, J., & Dyer, S. (2021). Understanding fare structures in urban taxi services. Transportation Research Part A: Policy and Practice, 145, 1-12.
McGrath, A., & Jonker, A. (2024). What is data wrangling?. IBM. Retrieved from https://www.ibm.com/think/topics/data-wrangling
New York City Taxi and Limousine Commission. (n.d.). About TLC. Retrieved from https://www.nyc.gov/site/tlc/about/about-tlc.page
New York City Taxi and Limousine Commission. (n.d.). Congestion surcharge. Retrieved from https://www1.nyc.gov/site/tlc/about/congestion-surcharge.page
New York City Taxi and Limousine Commission. (n.d.). TLC trip record data. https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page
Pandas Documentation. (2021). Pandas: A powerful data analysis and manipulation library for Python. Retrieved from https://pandas.pydata.org/docs/
Pratik. (2025). Transform your career: Build AI agents with our essential roadmap. Analytics Vidhya. Retrieved from https://www.analyticsvidhya.com/blog/2021/08/exploratory-data-analysis-and-visualization-techniques-in-data-science/#h-data-preparation
Schafer, J. L., & Graham, J. W. (2002). Missing data: Our view of the state of the art. Psychological Methods, 7(2), 147-177.
ScienceDirect. (n.d.). Mann-Whitney U test. Elsevier B.V. https://www.sciencedirect.com/topics/biochemistry-genetics-and-molecular-biology/mann-whitney-u-test
Tukey, J. W. (1977). Exploratory Data Analysis. Addison-Wesley.
Van Rossum, G. (2001). PEP 8 -- Style Guide for Python Code. Retrieved from https://www.python.org/dev/peps/pep-0008/